Roman Shrestha
December 9, 2022
Motivation
In September 2019, the Census Bureau reported that income inequality in the United States had reached its highest level in 50 years.
Research Question
Is there a relationship between a person’s income and their educational background, and capital gain in the United States?
Hypothesis
As the education level and capital gain of a person increases, their income will also increase.
The dataset we use was extracted by Barry Becker from the 1994 US Census Bureau’s database and was later donated to the Machine Learning Repository of University of California, Irvine Census Income Data Set.
32561 data set instances
15 attributes, includes variables like a person’s education level, race, capital gain
mix of continuous and discrete data
key explanatory variables: education, num_education and capital gain
outcome variable: income
control variables: marital status, occupation, race, and sex
# A tibble: 6 × 16
age workclass fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race sex
<dbl> <chr> <dbl> <fct> <dbl> <fct> <chr> <chr> <fct> <fct>
1 50 Self-emp-not… 83311 bachel… 13 Marrie… Exec-m… Husband White Male
2 38 Private 215646 high-s… 9 Divorc… Handle… Not-in… White Male
3 53 Private 234721 some-h… 7 Marrie… Handle… Husband Black Male
4 28 Private 338409 bachel… 13 Marrie… Prof-s… Wife Black Fema…
5 37 Private 284582 masters 14 Marrie… Exec-m… Wife White Fema…
6 49 Private 160187 some-h… 5 Marrie… Other-… Not-in… Black Fema…
# … with 6 more variables: capital_gain <dbl>, capital_loss <dbl>,
# hours_per_week <dbl>, native_country <fct>, income <fct>, income_bin <dbl>,
# and abbreviated variable names ¹education, ²education_num, ³marital_status,
# ⁴occupation, ⁵relationship
Histogram of Education and Income
Bar plot of Education and Income
Density plot of Capital gain vs Income
Residual plot
Summary statistics
| Education | Income Group | Mean of Education Years | Median of Education Years | SD of Education Years | Mean of Capital Gain | SD of Capital Gain |
|---|---|---|---|---|---|---|
| some_primary_middle_school | <=50K | 3.301 | 4 | 0.875 | 172.493 | 1413.253 |
| some_primary_middle_school | >50K | 3.548 | 4 | 0.670 | 1302.468 | 2718.439 |
| some-hs-school | <=50K | 6.497 | 7 | 0.932 | 108.069 | 854.967 |
| some-hs-school | >50K | 6.544 | 7 | 0.955 | 3398.736 | 13155.117 |
| high-school-grad | <=50K | 9.000 | 9 | 0.000 | 153.879 | 986.443 |
| high-school-grad | >50K | 9.000 | 9 | 0.000 | 2805.281 | 12044.407 |
| some-college | <=50K | 10.000 | 10 | 0.000 | 125.685 | 825.845 |
| some-college | >50K | 10.000 | 10 | 0.000 | 2612.823 | 10605.908 |
| associate | <=50K | 11.440 | 11 | 0.497 | 158.799 | 776.956 |
| associate | >50K | 11.423 | 11 | 0.494 | 2207.693 | 7035.267 |
| bachelors | <=50K | 13.000 | 13 | 0.000 | 162.260 | 837.939 |
| bachelors | >50K | 13.000 | 13 | 0.000 | 4004.705 | 14024.036 |
| masters | <=50K | 14.000 | 14 | 0.000 | 285.775 | 1792.404 |
| masters | >50K | 14.000 | 14 | 0.000 | 4376.397 | 14284.895 |
| prof-school | <=50K | 15.000 | 15 | 0.000 | 186.954 | 751.534 |
| prof-school | >50K | 15.000 | 15 | 0.000 | 14113.712 | 30740.170 |
| doctorate | <=50K | 16.000 | 16 | 0.000 | 220.486 | 899.842 |
| doctorate | >50K | 16.000 | 16 | 0.000 | 6361.039 | 19783.396 |
Linear Model
Every dollar increase in capital gain increases probability of higher income
Every education level increase increases the probability of higher income
The slope for all of our explanatory variables are positive, there seems to be a positive linear correlation between our explanatory variables and outcome variables.
However, on plotting the residual plot, we find out that the relationship between our variables is not linear.
There could be a better model other than the linear regression model to explain the relationship.
| Term | Estimate | Std Error | Statistic | P Value |
|---|---|---|---|---|
| (Intercept) | -0.1616961 | 0.0223596 | -7.2316260 | 0.0000000 |
| capital_gain | 0.0000083 | 0.0000003 | 30.7411302 | 0.0000000 |
| educationsome-hs-school | 0.0153229 | 0.0211389 | 0.7248664 | 0.4685392 |
| educationhigh-school-grad | 0.0319746 | 0.0328280 | 0.9740051 | 0.3300613 |
| educationsome-college | 0.0797344 | 0.0381008 | 2.0927212 | 0.0363818 |
| educationassociate | 0.0919037 | 0.0460746 | 1.9946700 | 0.0460872 |
| educationbachelors | 0.2077272 | 0.0541460 | 3.8364290 | 0.0001251 |
| educationmasters | 0.3052909 | 0.0599616 | 5.0914368 | 0.0000004 |
| educationprof-school | 0.3557377 | 0.0664992 | 5.3495011 | 0.0000001 |
| educationdoctorate | 0.4070083 | 0.0723654 | 5.6243496 | 0.0000000 |
| education_num | 0.0138270 | 0.0055156 | 2.5068732 | 0.0121853 |
Hypothesis Testing
Null hypothesis: there is no association between income and education
Alternative hypothesis: there is an association between income and education
Chi-squared Test for our hypothesis testing
Chi-squared value is 4427 and p value was negligible
We reject the null hypothesis
There is an association between income and education level
X-squared
4426.764
Our statistical analysis verified the hypothesis that as the education level and capital gain of a person increases, their income will also increase.
The results produced from linear regression, hypothesis testing, and EDA suggest a general relationship between our observed variables, and even more so a positive predictive relationship.
This research only provides a glimpse into the factors that influence a person’s income from two decades ago, and does not necessarily apply to present time.
We recognize that we do not have the coding experience in order to properly compute non-linear models we need, as referenced previously on regression analysis.
We encourage future research on factors that influence a person’s (in)accessibility to education, since our result shows that education is a key indicator of income.